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Description 

The invention disclosed broadly relates to data 
processing systems and more particularly relates to sys- 
tems and methods for enhancing fault tolerance in data s 
processing systems. 

Operational availability is defined as follows: "If a 
stimulus to the system is processed by the system and 
the system produces a correct result within an allocated 
response time for that stimulus, and if this is true for all 
stimuli, then the system's availability is 1.0." 

It is recognized that there are many contributors to 
high operational availability: (1 ) failures in both the hard- 
ware system and the software system must be detected 
with sufficiently high coverage to meet requirements; (2) 
the inherent availability of the hardware (in terms of sim- 
ple numerical availability of its redundancy network), in- 
cluding internal and external redundancy must be higher 
than the system's required operational availability; and 
(3) failures in the software must not be visible to or ad- 
versely affect operational use of the system. This inven- 
tion addresses the third of these contributors with the 
important assumption that software failures due to de- 
sign errors and to hardware failures, will be frequent and 
hideous. 

The prior art has attempted to solve this type of 
problem by providing duplicate copies of the entire soft- 
ware component on two or more individual processors, 
and providing a means for communicating the health of 
the active processor to the standby processor. When the 
standby processor determines, through monitoring the 
health of the active processor that the standby proces- 
sor must take over operations, the standby processor 
initializes the entire software component stored therein 
and the active processor effectively terminates its oper- 
ations. The problem with this approach is that the entire 
system is involved in each recovery action. As a result, 
recovery times tend to be long, and failures in the recov- 
ery process normally render the system inoperable. In 
addition, if the redundant copies of the software systems 
are both normally operating (one as a shadow to the oth- 
er), then the effect of common-mode failures is extreme 
and also affects the whole system. 

An approach solving the above problem is dis- 
closed in the Proceedings of the Ninth Symposium on 
Reliable Distributed Systems, 9-11 October 1990, Ala- 
bama, entitled The Design and Implementation of a Re- 
liable Distributed Operating System - ROSE' by T. P. NG. 
Therein proposed is a modular distributed operating 
system that provides support for building reliable appli- 
cations and handling hardware failures. The aim is to 
increase data availability despite those failures. Provid- 
ed are so-called Replicated Address Space (RAS) ob- 
jects whose content is accessible with a high probability. 
Further, a so-called Resilient Process (RP) abstraction 
allows user processes to survive hardware failures with 
minimal interruption. In particular two different imple- 
mentations of the Resilient Processes are discussed, 



one who checkpoints the information about its state in 
an RAS object periodically, the other which uses rapli- 
cated execution by executing the same code in different 
nodes at the same time. 'Modularity' of the distributed 
operating system in this sense means that the kernel 
layer of the operating system supports multiple process- 
es with separate address spaces. Within each process 
and its address space, multiple tasks can execute con- 
currently and communicate with one another by a 
shared memory. Several processes can replicate and 
share a portion of their address space. Several process- 
es replicate the execution of an application to provide 
the image of a single but very reliable process. The ap- 
plication may set up two processes that belong to the 
same configuration object, and share an RAS object. In 
case of a failure of one of the processes the second 
process would reconfigure to remove the first process 
from the configuration. If the reconfiguration is success- 
ful, the second process would read the content of the 
RAS object and resume processing were the first proc- 
ess has left off. 

It is therefore an object of the invention to increase 
the operational availability of a system of computer pro- 
grams operating in a distributed system of computers 
where recovery from a failure of software or hardware 
occurs before the failure becomes operationally visible. 

It is another object of the invention to provide fault 
tolerance in a system of computer programs operating 
in a distributed system of computers, having a high 
availability and fast recovery time. 

It is still a further object of the invention to provide 
improved operational availability in a system of compu- 
ter programs operating in a distributed system of com- 
puters, with less software complexity, than was required 
in the prior art. 

These objects, features and advantages of the in- 
vention are by the method as set forth in claim 1 . This 
invention provides a mechanism to organize the com- 
puter software in such a way that its recovery from a 
failure (of either itself or the hardware) occurs before the 
failure becomes operationally visible. In other words, the 
software is made to recover from the failure and reproc- 
ess or reject the stimulus so that the result is available 
to the user of the system within the specified response 
time for that type of stimulus. 

A software structure that will be referred to as an 
operational unit (OU), and a related availability manage- 
ment function (AMF) are the key components of the in- 
vention. The OU and portions of the AMF are now de- 
scribed. The OU concept is implemented by partitioning 
as much of the system's software as possible into inde- 
pendent self-contained modules whose interactions 
with one another is via a network server. A stimulus en- 
ters the system and is routed to the first module in its 
thread, and from there traverses all required modules 
until an appropriate response is produced and made 
available to the system's user. 

Each module is in fact two copies of the code and 
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data-space of the OU. One of the copies, called the Pri- 
mary Address Space (PAS), maintains actual state data. 
The other copy, called the Standby Address Space 
(SAS), runs in a separate processor, and may, or may 
not maintain actual state data, as described later. s 

The Availability Management Function (AMF) con- 
trols the allocation of PAS and SAS components to the 
processors. When the AMF detects an error, a SAS be- 
comes PAS and the original PAS is terminated. Data 
servers in the network are also informed of the change 
so that all communication will be redirected to the new 
PAS. In this fashion, system availability can be main- 
tained. 

These and other objects, features and advantages 
of the invention will be more fully appreciated with ref- 
erence to the accompanying figures. 

Fig. 1 is a timing diagram illustrating response 

time allocation. 

Fig. 2A is a schematic diagram of several opera- 
tional units in a network. 

Fig. 2B is an alternate diagram of the operational 
unit architecture illustrating the use of a 
data server OU by several applications. 

Fig. 2C is an illustration of the operational unit ar- 
chitecture. 

Fig. 2D is an illustration of the operational unit ar- 
chitecture showing an example of the al- 
location of general operational units 
across three groups. 

Fig. 3A-F shows of the operational unit at various 
stages during the initialization and opera- 
tion, reconfiguration and recovery func- 
tions. 

Fig. 1 shows a timing diagram illustrating response 
time allocation. The required response time (Tmax) be- 
tween a stimulus input and its required output suballo- 
cated so that a portion of it is available for normal pro- 
duction of the response (Tnormal), a portion is available 
for unpredicted resource contention (Tcontention), and 
a portion is available for recovery from failures (Trecov- 
ery). The first of these, Tnormal, is divided among the 
software and hardware elements of the system in ac- 
cordance with their processing requirements. This allo- 
cation will determine the performance needed by each 
hardware and software component of the system. The 
second portion, Tcontention, is never allocated. The last 
portion, Trecovery, is made sufficiently long that it in- 
cludes time for error detection (including omission fail- 
ures), hardware reconfiguration, software reconfigura- 
tion, and reproduction of required response. A rule of 
thumb is to divide the required response time Tmax in 



half and subdivide the first half so that one quarter of 
the required response time is available for normal re- 
sponse production Tnormal, and the second quarter is 
available for unpredicted resource contention, Tconten- 
tion. The second half of the response time, Trecovery, 
is then available for failure detection and response re- 
production. 

The specific problem addressed by this invention is 
how to reduce the time required for hardware and soft- 
ware reconfiguration, Trecovery, of a complex system 
to a small fraction of Tmax. This problem is solved by a 
software structure that will be referred to as an opera- 
tional unit (OU), and a related availability management 
function (AMF). The OU and portions of the AMF are 
now described. 

Referring to Fig. 2A, the OU concept is implement- 
ed by partitioning as much of the system's software as 
possible into independent self-contained modules or 
OUs 10 whose interactions with one another are via a 
network server 12. None of these modules shares data 
files with any other module, and none of them assumes 
that any other is (or is not) in the same machine as itself. 
A stimulus 1 4 enters the system and is routed to the first 
module in its thread and from there traverses all request- 
ed modules until an appropriate response is produced 
and made available to the system's user 16. 

Each module maintains all necessary state data for 
its own operation. If two or more modules require access 
to the same state knowledge then: (1 ) each must main- 
tain the knowledge; (2) updates to that knowledge must 
be transmitted among them as normal processing trans- 
actions; or (3) each must be tolerant of possible differ- 
ence between its knowledge of the state, and the knowl- 
edge of the others. This tolerance may take on several 
forms depending on the application, and may include 
mechanisms for detecting and compensating (or cor- 
recting) for differences in state. If two modules 1 0 abso- 
lutely must share a common data base for performance 
or other reason, then they are not "independent" and 
are combined into a single module for the purpose of 
this invention. 

It is acceptable for one module 10* to perform data 
server functions for multiple other modules (as shown 
in Fig. 2B) providing those other modules 10 can oper- 
ationally compensate for failure and loss of the server 
function. Compensate means that they continue to pro- 
vide their essential services to their clients, and that the 
inability to access the common state does not result in 
unacceptable queuing or interruption of service. Clearly 
this constrains the possible uses of such common serv- 
ers. 

Finally, a module 10 must provide predefined de- 
graded or alternative modes of operation for any case 
where another module's services are used, but where 
that other module is known to be indefinitely unavaila- 
ble. An exception to this rule is that if a server is part of 
a processor (if it deals with the allocation or use of proc- 
essor resources), then the module may have uncondi- 
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tional dependencies on the server. In this case, failure 
of the server is treated by the availability management 
function as failure of the processor. If a module conforms 
to all of the above conditions, then it becomes an OU 
through the application of the following novel structuring s 
of its data, its logic and its role within the system. This 
structure is shown in Fig. 2C. 

Two complete copies of the modules are loaded into 
independent address spaces 20, 20' in two separate 
computers 22, 22'. One of these modules is known by 
the network server as the Primary Address Space (PAS) 
and is the one to which all other module's service re- 
quests are directed. The other of these modules is called 
the Standby Address Space (SAS), and is known only 
by the PAS. It is invisible to all other modules. The PAS 
sends application dependent state data to the SAS so 
that the SAS is aware of the state of the PAS. Whether 
the interface between the PAS and the SAS of an OU is 
a synchronous commit or is an unsynchronized inter- 
face is not limited by this invention, and is a trade-off 
between the steady-state response time, and the time 
required for a newly promoted PAS to synchronize with 
its servers and clients. This trade-off is discussed in 
strategy #1 below. 

The PAS maintains the state necessary for normal 
application processing by the module. The SAS main- 
tains sufficient state knowledge so that it can transition 
itself to become the PAS when and if the need should 
arise. The amount of knowledge this requires is appli- 
cation dependent, and is beyond the scope of this in- 
vention. 

Both the PAS and the SAS of an OU maintain open 
sessions with all servers of the OU. The SAS sessions 
are unused until/unless the SAS is promoted to PAS. 
When the SAS is directed by the AMF to assume the 
role of PAS, it assures that its current state is self-con- 
sistent (a failure may have resulted from the PAS send- 
ing only part of a related series of update messages), 
and then communicates with clients and servers of the 
PAS to establish state agreement with them. This may 
result in advancing the SAS's state knowledge or in roll- 
ing back the state knowledge of the clients and servers. 
Any rolling back must be recovered by reprocessing any 
affected stimuli, and/or by informing users that the stim- 
uli have been ignored. Simultaneous with this process, 
the network server is updated so that it directs all new 
or queued service requests to the SAS instead of the 
PAS. This last action constitutes promotion of the SAS 
to the position of PAS, and is followed by starting up a 
new SAS in a processor other than the one occupied by 
the new PAS. 

Several strategies for maintenance of standby data 
by the PAS are relevant to the invention. They are sum- 
marized as follows. 

Strategy #t. The SAS may retain a complete copy 
of the state of the PAS. If the copy is committed to the 
SAS before the PAS responds to its stimulus, then the 
restart recovery will be very fast, but the response time 



for all stimuli will be the longest. This approach is supe- 
rior if response time requirements allow it. 

Strategy #2. The SAS may retain a "trailing" copy 
of the state of the PAS. Here, the PAS sends state up- 
dates as they occur, at the end of processing for a stim- 
ulus, or batches them for several stimuli. The SAS trails 
the PAS state in time and must therefore be concerned 
with state consistency within its data and between itself 
and its servers and clients. This yields very fast steady- 
state response time but requires moderate processing 
during failure recovery. 

Strategy #3. The SAS may retain knowledge of the 
stimuli currently in process by the PAS so that at failure, 
the exact state of the relationships between the PAS and 
its clients and servers is known by the SAS. This re- 
quires commitment of beginning and end of transactions 
between the PAS and the SAS, but reduces the inter- 
OU synchronization required during failure recovery. 

These mechanisms may be used alone or in various 
combinations by an OU. The determination of which to 
use is a function of the kinds of stimuli, the response 
time requirements, and the nature of the state retained 
by the application. 

The Availability Management Function (AMF) 

The characteristics just described comprise an OU, 
but do not by themselves achieve availability goals. 
Availability goals are achieved by combining this OU ar- 
chitecture with an AMF that controls the state of all OU's 
in the system. The AMF has three components, each 
with its own roles in the maintenance of high availability 
of the system's OU's. The relationship between an OU 
and the AMF is illustrated in Figs. 3A through 3F and is 
described below. 

The most important AMF function is that of group 
manager. A group is a collection of similarly configured 
processors that have been placed into the network for 
the purpose of housing a predesignated set of one or 
more OU's. Each group in the system (if there are more 
than one) is managed independently of all other groups. 
In Fig. 2D, three groups are shown. Here, the OU's re- 
siding in each group (rather than the processors) are 
shown. 

The number of processors in each group, and the 
allocation of OU PAS and SAS components to those 
processors may be widely varied in accordance with the 
availability requirement, the power of each processor, 
and the processing needs of each PAS and SAS com- 
ponent. The only constraint imposed by the invention is 
that the PAS and the SAS of any single OU must be in 
two different processors if processor failures are to be 
guarded against. 

Group Management: Referring to Fig. 3A, the 
group manager resides in one of the processors of a 
group (the processor is determined by a protocol not in- 
cluded in this invention). It initializes the group for cold 
starts, and reconfigures the group when necessary for 
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failure recovery. Within the group, each OU's PAS and 
SAS exist on separate processors, and sufficient proc- 
essors exist to meet inherent availability requirements. 
The group manager monitors performance of the group 
as a unit and coordinates the detection and recovery of 
all failures within the group that do not require attention 
at a system level. This coordination includes the follow- 
ing: 

1 . Commanding the takeover of an OU's functional 
responsibilities by the SAS when it has been deter- 
mined that a failure has occurred in either the PAS, 
or in the processor or related resource necessary 
to the operation of the PAS. 

2. initiating the start-up of a new address space to 
house a new SAS after a prior SAS has been pro- 
moted to PAS. 

3. Updating the network server's image of the loca- 
tion of each OU as its PAS moves among the proc- 
essors of a group in response to failures or sched- 
uled shutdown of individual group processors. 

Errors detected by the group manager include proc- 
essor failure (by heartbeat protocols between group 
members), and failures of an OU that affect both its PAS 
and SAS, for example, those caused by design errors 
in the communication between them or the common 
processing of standby/backup state data. The mecha- 
nisms for error detection are beyond the scope of the 
invention. 

Local Management: Group level support of the OU 
architecture relies on the presence of certain functions 
within each processor of the group. These functions are 
referred to as the AMF's local manager 32. The local 
manager is implemented within each processor as an 
extension of the processor's control program and is re- 
sponsible for detecting and correcting failures that can 
be handled without intervention from a higher (group or 
above) level. The local manager maintains heartbeat 
protocol communications with each OU PAS or SAS in 
its processor to watch for abnormalities in their perform- 
ance. It also receives notification from the operating sys- 
tem of any detected machine level hardware or software 
problem. Any problem that cannot be handled locally is 
forwarded to the group manager for resolution. 

Global Management: The isolation and correction 
of system level problems and problems associated with 
the network fall with the AMF's global manager 34. The 
global manager 34 correlates failures and states at the 
lower levels to detect and isolate errors in network be- 
havior, group behavior, and response times of threads 
involving multiple processors. It also interacts with hu- 
man operators of the system to deal with failures that 
the automation cannot handle. The global manager is 
designed to operate at any station in the network, and 
is itself an OU. The movement of the global manager 34 



OU from one group to another, if necessary, is initiated 
and monitored by the human operator using capabilities 
contained in the local manager at each processor sta- 
tion. 

5 Network Management: The network manager is a 
part of the AMF global manager 34. Its functionality and 
design is configuration dependent, and is beyond the 
scope of this invention. 

Figs. 3A-F show the functionality for the AMF during 

10 initialization, operation, reconfiguration and recovery of 
the system. In Fig. 3A, PAS and SAS are loaded and 
initialized. OU locations are entered in the network serv- 
er. Fig. 3B shows the synchronization of the PAS and 
SAS with clients and servers to establish consistency of 

15 states. Fig. 3C shows steady state operation with PAS 
responding to a stimulus and outputting a response. The 
SAS is kept updated by the PAS. 

Fig. 3D shows what happens when the PAS fails. 
The old SAS is promoted to new PAS and a new SAS 

20 is loaded and initialized. The network servers are also 
updated with the new location of the OU. Fig. 3E shows 
re-synchronization of the new PAS with the clients and 
servers. At steady state (Fig. 3E), the new PAS is re- 
sponding to stimuli as normal. 

25 it is the combination of the OU structure and the re- 
covery functions of the AMF that constitute this inven- 
tion. By comparison, the contemporary strategy and 
mechanisms of software fai lover/switchover deal with 
units of entire processors rather than with smaller units 

30 of software as in this invention. Because the unit of fail- 
ure and recovery achieved by this invention is small by 
comparison, the time for recovery is also small. Further- 
more, the recovery from processor level failures can be 
accomplished in small steps spread across several 

35 processors (as many as are used to contain SAS's for 
the PAS's contained in the failed processor). This is cru- 
cial to maintaining continuous high availability operation 
in large real time systems. 

40 

Claims 

1. A method for increasing the operational availability 
of a system of computer programs operating in a 
45 distributed system of computers, comprising the 
steps of: 

dividing a computer program into a plurality of 
independent self-contained functional modules 

so (10), whose interactions with one another is via 

one or more servers (12) being interconnected 
in said distributed system, whereby none of 
said functional modules shares data with any 
other module, and whereby all of said functional 

55 modules maintains all necessary state data for 

their own operation; 

controlling the state of alt of said functional 
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modules (10) by availability management 
means, which monitor the performance of said 
functional modules, and which coordinate de- 
tection and recovery of system failures; 

5 

loading a first copy of a functional module into 
a first processor's (22) address space (20) and 
locating a second copy of said functional mod- 
ule into a second processor's (22') address 
space (20'), said two address spaces being 10 
physically separated and independent, and 
said second processor's address space being 
known only by the according first processor; 

said first processor executing said first func- 15 
tional module to send application dependent 
state data to said second processor where it is 
received by said second functional module ex- 
ecuting on said second processor; 

20 

said first processor executing said first module, 
maintaining a normal application processing 
state and said second processor executing said 
second module, maintaining a secondary state 
knowledge sufficient to enable it to become a 2$ 
primary functioning module; 

said first processor executing said first module 
maintaining open sessions with a a plurality of 
said servers connected therewith and said sec- 30 
ond processor of executing said second mod- 
ule maintaining a plurality of open sessions with 
all of said servers in the network; 



3. The method of claim 1 wherein said first module and 
said second module communicate asynchronously. 

4. The method of claim 1 wherein said second module 
retains a complete copy of the state of said first 
module. 

5. The method of claim 1 wherein said second module 
retains a trailing copy of the state of said first mod- 
ule. 

6. The method of claim 1 wherein said second module 
retains knowledge of the stimuli currently in process 
by said first module. 



Patentanspruche 

1. Ein Verfahren zur Erhohung der Betriebsverfugbar- 
keit eines Systems von Computerprogrammen in 
einem verteilten Computersystem, das folgende 
Schritte umfaGt: 

die Unterteilung eines Computerprogramms in 
zahlreiche unabhangige funktionale Module 
(10), die Qber einen Oder mehrere Server (12), 
die im verteilten System miteinander verbun- 
den sind, untereinander in Wechselwirkungtre- 
ten, wobei keines der funktionalen Module Da- 
ten mit einem anderen Modul teilt, und wobei 
alle funktionalen Module samtliche notwendi- 
gen Zustandsdaten fur ihren eigenen Betrieb 
speichern; 



said second functional module, in response to 35 
a stimulus requiring it to assume the role of said 
first functional module, checking that its current 
state is consistent with the current state of said 
first functional module, followed by said second 
module then communicating with said servers 40 
in said network to establish synchronization 
with the state of said servers, terminating said 
first functional module and updating said serv- 
ers with the new address space 

45 

all clients and servers connected in said net- 
work responding to said second module as- 
suming the role of said first module, by directing 
all new or queued service requests to said sec- 
ond module instead of to said first module; so 

whereby said second module assumes the role 
of said first module in performing primary ad- 
dress space (20) operations. 

55 

2. The method of claim 1 wherein said first module and 
said second module communicate synchronously. 



die Steuerung des Zustands aller funktionalen 
Module (10) durch Verfugbarkeitsmanage- 
mentmittel, die die Leistung der funktionalen 
Module uberwachen und die Erfassung und 
Behebung von Systemfehlern koordinieren; 

das Laden einer ersten Kopie eines funktiona- 
len Moduls in einen Adreflraum (20) eines er- 
sten Prozessors (22) und das Versetzen einer 
zweiten Kopie des funktionalen Moduls in einen 
Adrefiraum (20') eines zweiten Prozessors 
(22'), wobei beide AdreGraume physisch ge- 
trennt und unabhangig voneinander sind, und 
der AdreBraum des zweiten Prozessors nur 
dem entsprechenden ersten Prozessor be- 
kannt ist; 

wobei der erste Prozessor das erste funktiona- 
le Modul ausfuhrt, urn anwendungsabhangige 
Zustandsdaten zum zweiten Prozessor zu sen- 
den, wo diese vom zweiten funktionalen Modul 
empfangen werden, das auf dem zweiten Pro- 
zessor ausgefuhrt wird; 
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wobei der erste Prozessor das erste Modul 
ausfuhrt und 

einen normal en Anwendungsverarbeitungszu- 
stand halt, und der zweite Prozessor das zweite s 
Modul ausfuhrt und ausreichende Daten uber 
den zweiten Zustand halt, die es ihm ermoglt- 
chen, zum ersten funktionalen Modul zu wer- 
den; 

10 

wobei der erste Prozessor, der das erste Modul 
ausfuhrt, offene Sitzungen mit zahlreichen der 
Server hat, die mit ihm verbunden sind, und der 
zweite Prozessor, der das zweite Modul aus- 
fuhrt, zahlreiche offene Sitzungen mit alien Ser- is 
vern im Netzwerk hat; 

wobei das zweite funktionale Modul als Reak- 
tion auf einen Stimulus, der es veranlaBt, die 
Rolle des ersten funktionalen Moduls zu uber- 20 
nehmen, pruft, daft sein aktueller Zustand kon- 
sistent mit dem aktuellen Zustand des ersten 
funktionalen Moduls ist, worauf das zweite Mo- 
dul mit alien Serve m im Netzwerk kommuni- 
ziert, urn eine Synchronisierung mit dem Zu- 25 
stand der Server herzustellen, wonach das er- 
ste funktionale Modul aufgegeben und die Ser- 
ver mit dem neuen AdreGraum aktualisiert wer- 
den; 

30 

wobei alle im Netzwerk angeschlossenen Cli- 
ents und Server auf das zweite Modul reagie- 
ren, das die Rolle des ersten Moduls u be mom- 
men hat, indem alle neuen oder in Warte- 
schlangen befind lichen Dienstanforderungen 3S 
zum zweiten Modul anstatt zum ersten Modul 
geleitet werden; 

wobei das zweite Modul die Rolle des ersten 
Moduls ubernimmt, indem es Operationen ei- 40 
nes primaren AdreBraums (20) durchfuhrt 

2. Das Verfahren nach Anspruch 1 , bei dem das erste 
Modul und das zweite Modul synch ran kommuni- 
zieren. 45 

3. Das Verfahren nach Anspruch 1 , bei dem das erste 
Modul und das zweite Modul asynchron kommuni- 
zieren. 

so 

4. Das Verfahren nach Anspruch 1 , bei dem das zwei- 
te Modul eine vollstandige Kopie des Zu stands des 
ersten Moduls halt. 

5. Das Verfahren nach Anspruch 1 , bei dem das zwei- 55 
te Modu! eine Nachfolgekopie des Zustands des er- 
sten Moduls halt. 



6. Das Verfahren nach Anspruch 1 , bei dem das zwei- 
te Modul die Stimuli kennt, die gerade vom ersten 
Modul verarbeitet werden. 



Revendlcations 

1. Un precede pour augmenter la disponibilite opera- 
tionnelle d'un systeme de programmes d'ordinateur 
fonctionnant dans un systeme d'ordinateur distri- 
bue, comprenant les eta pes consistant a : 

diviser un programme d'ordinateur en une plu- 
rality de modules fonctionnels (10) auto-conte- 
nus, independants, dont les interactions mu- 
tuelles se font via un ou plusieurs serveurs (1 2) 
qui sont interconnects dans (edit systeme dis- 
tribue, dans lequel aucun desdits modules 
fonctionnels ne partage des donnees avec un 
quelconque autre module, et dans lequel la to- 
tality desdits modules fonctionnels conservent 
toutes les donnees d'etat necessaires pour leur 
propre fonctionnement; 

commander I'etat de la totalite desdits modules 
fonctionnels (10) a I'aide de moyens de gestion 
de la disponibilite, qui surveillent la performan- 
ce desdits modules fonctionnelle et qui coor- 
donnent la reflexion et la recuperation des err- 
reurs systeme; 

charger une premiere copie d'un module fonc- 
tionnel dans un espace d'adresse (20) de pre- 
mier processeur (22) et placer une deuxieme 
copie dudit module fonctionnel dans un espace 
d'adresse (20') de deuxieme processeur (22'), 
lesdites deux espaces d'adresse etant physi- 
quement separes et independants et led it 
deuxieme espace d'adresse de deuxieme pro- 
cesseur etant connu seulement par le premier 
processeur correspondant; 

ledit premier processeur executant I edit pre- 
mier module fonctionnel afin d'envoyer des 
donnees d'etat dependantes de Implication 
audit deuxieme processeur, dans lequel elles 
sont recues par ledit deuxieme module fonc- 
tionnel en cours d'execution sur ledit deuxieme 
processeur; 

ledit premier processeur, executant ledit pre- 
mier module maintenant, un etat de traitement 
d'application normal et ledit deuxieme proces- 
seur, executant ledit deuxieme module, main- 
tenant une connaissance de I'etat secondaire 
suffisante, pour lui permettre de devenir un mo- 
dule a fonctionnement prima ire; 
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ledit premier processeur executant I edit pre- 
mier module maintenant des sessions ouvertes 
avec une plurality desdits serveurs lui 6tant 
connected et ledit deuxieme processeur exe- 
cutant ledit deuxieme module maintenant une 5 
plurality de sessions ouvertes avec la totalite 
desdits serveurs dans le reseau; 



ledit deuxieme module fonctionnel, en r6ponse 
a un stimulus lui demandant de prendre le rdle 10 
dudit premier module fonctionnel, contr6lant 
que son 6tat actuel est coherent avec I'etat ac- 
tuel dudit premier module fonctionnel, suivi par 
le fait que ledit deuxieme module communique 
ensuite avec lesdits serveurs dans ledit reseau, is 
pour 6tablir la synchronisation avec l'6tat des- 
dits serveurs, en finissant par ledit premier mo- 
dule fonctionnel et la mise a jour desdits ser- 
veurs par le nouvel espace d'adresse; 

20 

la total it 6 des clients et serveurs qui sont con- 
nects dans ledit reseau repondant audit 
deuxieme module en prenant le role dudit pre- 
mier module, en dirigeant la totalite des requi- 
tes de service, qu'elles soient nouvelles ou mi- 25 
ses en file d'attente, vers ledit deuxieme modu- 
le au lieu de les diriger vers ledit premier mo- 
dule; 

de maniere que ledit deuxieme module prenne 30 
le rdle dudit premier module en effect uant les 
operations d'espace d'adresse primaires (20). 

2. Le precede selon la revendication 1 , dans lequel le- 
dit premier module et ledit deuxieme module com- 35 
muniquent de facon synchrone. 

3. Le proceed selon la revendication 1 , dans lequel le- 
dit premier module et ledit deuxieme module dans 
lequel ledit premier module et ledit deuxieme mo- 40 
dule communiquent de facon asynchrone. 

4. Le precede selon la revendication 1 , dans lequel le- 
dit deuxieme module conserve une copie complete 

de l'6tat dudit premier module. *s 

5. Le precede selon la revendication 1 , dans lequel le- 
dit deuxieme module conserve une copie recapitu- 
lative de I'etat dudit premier module. 

so 

6. Le precede selon la revendication 1 , dans lequel le- 
dit deuxieme module conserve la connaissance des 
stimuli actuellement mis en oeuvre par ledit premier 
module. 

55 
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